2015-02-09 Creative Commons License

Data analysis is often an afterthought, but shouldn't be!

After this class students will be able to:

  • distinguish between the two main purposes of visualizations
  • explain the layered grammer of graphics
  • be excited to struggle with ggplot2

Why are we doing this?

Visualization is important both for

  • learning about your data
  • communicating the results of your analyses

Models and tables are fine, but often I find that visualizations are more helpful for understanding what is going on (plus they make your presentations look WAY cooler)

Examples

visualizations for learning

visualizations for communicating

  • Green dots show the 30-year average of the new PAGES 2k reconstruction
  • Red curve shows the global mean temperature, according HadCRUT4 data from 1850 onwards
  • Blue curve is the original hockey stick of Mann, Bradley and Hughes (1999) with its uncertainty range (light blue).
  • Graph by Klaus Bitterman

by Healy

By Kenworthy

By Jackman

  • When was the last time you saw a table of regression results in a talk?
  • When was the last time you saw a graph likes these in a talk?
  • Which do you want to have in your talk?

making graphs

graphing in R

  • base graphics by Ross Ihaka: model is paper and pencil (no grammar)
  • grid graphics by Paul Murrell
  • lattice graphics by Deepayan Sarkar
  • ggplot2 by Hadley Wickham (based on Wilkinson)

note on coding error

Thank you to Kieran Healy for being open about this so that we can all learn from it

note on coding error

corrplot(c.mat, method="shade", shade.col=NA, tl.col="black",
         order="hclust", hclust.method="ward", tl.srt=45)

corrplot(c.mat,add=TRUE, type="lower", method="number",
         order="AOE", diag=FALSE, tl.pos="n", cl.pos="n")

How could you write this differently?

note on coding error

Don't repeat yourself

data.to.plot <- c.mat
kHclustMethod <- "ward"
kOrder <- "hclust"

corrplot(data.to.plot, method = "shade", shade.col = NA, tl.col = "black",
         order = kOrder, hclust.method = kHclustMethod, 
         tl.srt = 45)

corrplot(data.to.plot, add = TRUE, type = "lower", method = "number",
         order = kOrder, hclust.method = kHclustMethod, 
         diag = FALSE, tl.pos = "n", cl.pos = "n")

ggplot2

Why ggplot2?

  • based on a grammar
  • excellent faceting (especially compared to base R)
  • integrates well with dplyr and the tidy data philosophy
  • lots of add-ons: ggalley and ggmaps
  • natural transition to ggvis, the next big thing
  • strong and active community

New York 311 data, Nicole Pangborn

New York 311 data, Nicole Pangborn

library(ggmap)
library(ggplot2)
NYC <- get_map(location = "new york, new york", zoom = 11)
p <- ggmap(NYC)
p + geom_point(aes(x = dfcalls_small$Longitude, y = dfcalls_small$Latitude)) +
  labs(title = "311 calls in NYC, 1/28/15 - 1/29/15") + 
  theme(plot.title = element_text(size=rel(2)))

suppressPackageStartupMessages(library(dplyr))
library(ggplot2)
packageVersion("ggplot2")
## [1] '1.0.1'

More on ggplot2 version 1.0

Basic components:

  • default dataset and set of mappings from variables to aesthetics
  • one or more layers

world.pop.data <- read.csv("data/wdata.csv", head = TRUE, sep = ",")
world.pop.data <- tbl_df(world.pop.data)
glimpse(world.pop.data)
## Observations: 158
## Variables:
## $ country (fctr) Algeria, Egypt, Libya, Morocco, South Sudan, Sudan, T...
## $ pop2012 (dbl) 37.4, 82.3, 6.5, 32.6, 9.4, 33.5, 10.8, 9.4, 17.5, 20....
## $ imr     (int) 24, 24, 14, 30, 101, 67, 20, 81, 65, 73, 70, 47, 89, 1...
## $ tfr     (dbl) 2.9, 2.9, 2.6, 2.3, 5.4, 4.2, 2.1, 5.4, 6.0, 4.6, 4.9,...
## $ le      (int) 73, 72, 75, 72, 52, 60, 75, 56, 55, 55, 58, 64, 54, 48...
## $ leM     (int) 72, 70, 72, 70, 50, 58, 73, 54, 54, 54, 57, 63, 52, 47...
## $ leF     (int) 75, 74, 77, 74, 53, 62, 77, 58, 56, 56, 59, 65, 55, 50...
## $ region  (fctr) Northern Africa, Northern Africa, Northern Africa, No...
## $ area    (fctr) Africa, Africa, Africa, Africa, Africa, Africa, Afric...

world.pop.data
## Source: local data frame [158 x 9]
## 
##          country pop2012 imr tfr le leM leF          region   area
## 1        Algeria    37.4  24 2.9 73  72  75 Northern Africa Africa
## 2          Egypt    82.3  24 2.9 72  70  74 Northern Africa Africa
## 3          Libya     6.5  14 2.6 75  72  77 Northern Africa Africa
## 4        Morocco    32.6  30 2.3 72  70  74 Northern Africa Africa
## 5    South Sudan     9.4 101 5.4 52  50  53 Northern Africa Africa
## 6          Sudan    33.5  67 4.2 60  58  62 Northern Africa Africa
## 7        Tunisia    10.8  20 2.1 75  73  77 Northern Africa Africa
## 8          Benin     9.4  81 5.4 56  54  58  Western Africa Africa
## 9   Burkina Faso    17.5  65 6.0 55  54  56  Western Africa Africa
## 10 Cote d'Ivoire    20.6  73 4.6 55  54  56  Western Africa Africa
## ..           ...     ... ... ... .. ... ...             ...    ...

All together, the layered grammar defines a plot as the combination of:

  • A default dataset and set of mappings from variables to aesthetics.
  • One or more layers, each composed of a geometric object, a statistical transformation, and a position adjustment, and optionally, a dataset and aesthetic mappings.
  • One scale for each aesthetic mapping.
  • A coordinate system.
  • The faceting specification.

Wickham (2009)

Basic plot

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point")

adding aesthetics

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr, color = area))
p + layer(geom = "point")

adding aesthetics

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr, color = area, size = pop2012))
p + layer(geom = "point")

adding aesthetics

p <- ggplot(data = world.pop.data, 
            aes(x = tfr, y = le, color = area, size = pop2012))
p + layer(geom = "point")

Basic plot

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point")

changing the geom

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "line")

changing the geom

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "blank")

Basic plot

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point")

adding a stat layer

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth")

adding a stat layer

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "loess")

adding a stat layer

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "lm")

adding a stat layer

p <- ggplot(data = world.pop.data, 
            aes(x = le, y = tfr))
p + layer(geom = "point") + layer(stat = "smooth", method = "lm", se = FALSE)

Basic plot

p <- ggplot(data = world.pop.data, aes(x = le, y = tfr))

adding faceting

p <- ggplot(data = world.pop.data, aes(x = le, y = tfr))
p + layer(geom = "point") + facet_grid(area ~ .)  

adding faceting

p <- ggplot(data = world.pop.data, aes(x = le, y = tfr))
p + layer(geom = "point") + facet_grid(. ~ area)  

adding faceting

p <- ggplot(data = world.pop.data, aes(x = le, y = tfr))
p + layer(geom = "point") + facet_grid(. ~ area) + layer(stat = "smooth", method = "loess")

Questions about visualization and ggplot2?

One thing that you should know that most people don't talk about:

  • you will build and tweak your graphs over time; the first graph is never the final graph

Goal check

Review and motivation for next class

going forward

  • ggplot2 is confusing at first
  • but it will make you powerful

It just takes practice!